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ROBUST RECOVERY OF MULTIPLE SUBSPACES BY 
GEOMETRIC L P MINIMIZATION 1 

By Gilad Lerman and Teng Zhang 

University of Minnesota 

We assume i.i.d. data sampled from a mixture distribution with K 
components along fixed d-dimensional linear subspaces and an addi- 
tional outlier component. For p > 0, we study the simultaneous recov- 
ery of the K fixed subspaces by minimizing the Z p -averaged distances 
of the sampled data points from any K subspaces. Under some con- 
ditions, we show that if <p < 1, then all underlying subspaces can 
be precisely recovered by l p minimization with overwhelming prob- 
ability. On the other hand, if if > 1 and p > 1, then the underlying 
subspaces cannot be recovered or even nearly recovered by l p mini- 
mization. The results of this paper partially explain the successes and 
failures of the basic approach of l p energy minimization for modeling 
data by multiple subspaces. 

1. Introduction. In the last decade, many algorithms have been devel- 
oped to model data by multiple subspaces. Such hybrid linear modeling 
(HLM) was motivated by concrete problems in computer vision as well as 
by nonlinear dimensionality reduction. HLM is the simplest geometric frame- 
work for nonlinear dimensionality reduction. Nevertheless, very little theory 
has been developed to justify the performance of existing methods. Here we 
give a rigorous analysis of the recovery of multiple subspaces via an energy 
minimization. 

One can model a data set X with K subspaces obtained by minimizing 
the following energy over the subspaces Li, . . . ,Lk'- 



(1) eu(X,L u ~- 
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where dist(-,-) denotes the Euclidean distance and p > is a fixed param- 
eter. For simplicity, we assume that L\, . . . ,Lk are linear subspaces of the 
same dimension d, and we refer to them as d-subspaces (generalizations are 
discussed in Sections 5.6 and 5.7). We also assume that the data set X con- 
tains i.i.d. samples from a mixture distribution fj, with K components along 
fixed ci-subspaces and an additional outlier component. The recovery prob- 
lem asks whether with overwhelming probability the minimization of (1) 
recovers the underlying subspaces of /i. We show here that when p <1 the 
answer to this problem is positive, whereas when p > 1 it is negative. 

Recovery problems are common in statistics, for example, recovering a sin- 
gle subspace in least squares type problems or recovering multiple centers 
as in iC-means. However, our recent setting requires novel developments. 
One issue is the strong geometric nature of our problem, resulting from an 
optimization on a product space of Grassmannians. The other is the diffi- 
culty of approximating the problem by convex optimization (as we clarify 
in Section 5.1). Thus, even though it is an elementary problem in statistical 
learning, it requires the development of techniques which are currently not 
widely common in statistics. 

1.1. Background and related work. Many algorithms have been devel- 
oped for HLM (see, e.g., [1, 5, 8-11, 13, 14, 20-26]), and they find diverse 
applications in several areas, such as motion segmentation in computer vi- 
sion, hybrid linear representation of images, classification of face images and 
temporal segmentation of video sequences (see, e.g., [14, 23, 26]). HLM is 
the simplest nonlinear data modeling and fits within the broader frameworks 
of modeling data by mixture of manifolds [3] and by Whitney's stratified 
space [4]. 

The -fT-subspaces algorithm [5, 10, 22] is the most basic heuristic for HLM, 
and it suggests an iterative procedure attempting to minimize the energy (1) 
with p = 2. It generalizes the X-means algorithm, which models data by K 
centers, that is, 0-dimensional affine subspaces. Numerical experiments by 
Zhang et al. [25] have shown that the X-subspaces algorithm is in general 
not robust to outliers, whereas a different method aiming to minimize (1) 
with p=l seems to be robust to outliers. 

There has been little investigation into performance guarantees of the 
various HLM algorithms. Nevertheless, the accuracy of segmentation under 
some sampling assumptions was analyzed for two spectral-type HLM algo- 
rithms in [7] and [3], where [3] also quantified the tolerance to outliers ([3] 
considers only the asymptotic case, though applies to modeling by multiple 
manifolds). For the ET-means algorithm (which only applies to 0-dimensional 
affine subspaces), Pollard has established strong consistency [16] and a cen- 
tral limit theorem [17]. 
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In [12], we analyzed the /^-recovery of the "most significant" subspace 
among multiple subspaces and outliers with spherically symmetric under- 
lying distributions. We assume here a similar (though weaker) underlying 
model and rely on some of the estimates already developed there. 

1.2. Basic conventions and notation. We denote by G(D,d) the Grass- 
mannian, that is, the manifold of ci-subspaces of M. D . We measure distances 
between F and G in G(D,d) by the metric 



where {9i}f =1 are the principal angles between F and G. We use this distance 
since there is a simple formula for the geodesic lines on the Grassmannian 
equipped with this distance (see, e.g., [12], equation 12), which is applied in 
this paper. We distinguish elements in the K-iold product space G(D, d) K 
by the norm, that is, 



Following [15], Section 3.9, we denote by 7D,d the "uniform" distribution 
on G(D,d). 

We denote by a V b and a A b the maximum and minimum of a and b, 
respectively. We designate the support of a distribution \x by supp(//). By 
saying "with overwhelming probability" or, in short, "w.o.p.," we mean that 
the underlying probability is at least 1 — Ce~ N l c , where C is a constant 
independent of N . 

1.3. Setting of this paper. We assume an i.i.d. data set X C M. D of size N 
sampled from a mixture distribution representing a hybrid linear model 
around distinct (f-subspaces, {L*}f =l . We in fact consider two different types 
of models, but both of them have the same basic structure. 

We assume K distributions, each supported on a corresponding and 
distinct d-subspace, L*, a noise level e > 0, and an outlier distribution, 
denoted by fiQ. Furthermore, for each 1 < i < K we have a distinct noise 
distribution with bounded support in the orthogonal complement L|. 
We assume that the pth. moments of {||i / t,e||}|Li are smaller than e p for all 
0<p<l(j><lis only needed when we consider l p minimization with p < 1). 
Moreover, if e = 0, then {^,0}^! are the Dirac 5 distributions supported on 
the origin within the corresponding subspaces orthogonal to {L*} i=1 . 

We assume that the underlying distributions, {^i}f = Q, have bounded sup- 
ports (or possibly sub-Gaussian as explained in Section 5.3). In order to sim- 
plify our estimates, we further assume that supp(/ij) C B(0, 1) for < i < K. 




(3) 



dist G A'((Li,...,L^),(Li,...,Lx)) = max (dist G (L i5 L;)). 
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From these pieces we construct the mixture distribution jj,, 



(4) 



K 



where ao > 0, > VI < i < K and Yli=o Q « = 1- If £ = 0, then for conve- 
nience we replace the notation /j, £ by /i, that is, 

A 



Within this basic framework, we analyze two different models. For e > 
and (i £ as in (4), we say that \i e is a weakly spherically symmetric HLM dis- 
tribution with noise level e if the {fj,i}fL 1 are generated by rotations (in W D ) 
of a single distribution fi, such that /i({0}) < 1, supp(/i) C B(0, 1) n L for 
some d-subspace L C M. D and £l is spherically symmetric within L (i.e., in- 
variant to rotations within L). 

Our second model has weaker assumptions on the distributions of inliers 
and a slightly stronger assumption on the distribution of outliers. For e > 
and fj, £ as in (4), we say that \x e is a weak HLM distribution with noise level £ 
if /ii({0}) < 1 VI < i < K, supp(/i e ) C B(0, 1) and for some r > the uniform 
distribution on B(0,r) is absolutely continuous w.r.t. the restriction of no 
to B(0,r). 

Our theory uses the constant to = To(d,p, {^i]f = i)- We delay its definition 
to the proofs [see (11)], but use it in the formulation of Theorems 1.1 and 1.2. 

1.4. Statistical problems of this paper. We address here two statistical 
problems. The simpler one is implicit in this introduction, though clear from 
the proofs. It asks whether the underlying subspaces {L*}!^ can be recov- 
ered when e = by minimizing E At (dist p (x, (Ji=i Lj)) over {^i}iLi C G(D, d). 
The main problem can be formulated using the empirical distribution [jln 
of i.i.d. sample of size ./V from fi. It asks whether {L!}^ can be recovered 
(w.o.p.) by minimizing E w (dist p (x, (J i=1 Lj)), which is equivalent to min- 
imizing (1). In the noisy case, we extend these problems to near recovery. 
When K > 1 and d > 1 , these problems are nontrivial and require compli- 
cated geometric estimates. 

1.5. Main theory. We first formulate the exact recovery of {L*}^ as 
the unique global minimizer of the l p energy (1) when < p < 1. 



(5) 




i=i 



Theorem 1.1. Assume that \x is a weakly spherically symmetric HLM 
distribution on M. D without noise (e = 0) and with underlying subspaces 
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{lj*}f =l Cl c and mixture coefficients {a{\f = Q. Let X be an i.i.d. data set 
sampled from [/,. If < p < 1 and 



(6) a <T - min a* • 1 A min dist G (L*,L*) p /2 p , 

i=l,...,K V l<i,j<K J J 

then w.o.p. the set {L^, . . . ,L^} is the unique global minimizer of the ener- 
gy (1) among all d-subspaces in W D . 

Theorem 1.1 extends to the noisy case by allowing near-recovery as follows 
(a counterexample for asymptotic exact recovery is shown in Section 3.2). 

Theorem 1.2. Assume that e > and fj, E is a weakly spherically symmet- 
ric HLM distribution of noise level e on 1R D with K d-subspaces {L|}^£ 1 C M. D 
and mixture coefficients {aj}^£ . Let X be an i.i.d. data sampled from fj, £ . 
IfO<p<l and 

(7) e<3~ 1/p fr - min a* • (l A min dist G (L*, L*) P /2 P ) - a V ? , 

V i=l,...,K V l<i,j<K J J J 

then any minimizer of (1) in G(D,d) has a distance smaller than 

(8) f = f(e,K,d,p, {ai}f =1 ) = 3 1/p ■ (t min aj - a ) /P • e 

\ l<j<K / 

from one of the permutations of (L\, . . . ,L^) with overwhelming probability. 

At last, we formulate the impossibility to recover {L|}^ 1 by l p minimiza- 
tion when p > 1 (the constants So and Ko in our formulation are estimated 
in Section 4.5.5). 

Theorem 1.3. Assume an i.i.d. sample of K d-subspaces {L*}^ 1 C 
G(D, d) from the "uniform" distribution on G(D, d), jD,d- F° r £ > and the 
sample {L*}f =l , let [i £ be a weak HLM distribution with noise level e and 
let X be an i.i.d. data set of size N sampled from fi e . If p > 1 and K > 1, 
then for almost every {L|}^ 1 (w.r.t. 7^ d ) there exist positive constants 80 
and Ko, independent of N , such that for any e < 5o the minimizer of (1), 
Li,...,L^, satisfies w.o.p.: 

(9) dist G K((Li,...,L^),(L*,...,L^)) >k . 

The above theorems have direct implications for HLM with spherically 
symmetric sampling along the subspaces. Theorems 1.1 and 1.2 clarify to 
some extent the robustness of two recent algorithms for HLM, which use 
the h energy (1): Median if-Flats (MKF) [25] and Local Best-fit Flats 
(LBF) [27]. Theorem 1.3 explains why common HLM strategies that use 
the I2 energy (1) (e.g., if-subspaces) are generally not robust to outliers. 
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1.6. Structure of the paper. Theorems 1.1, 1.2 and 1.3 are proved in 
Sections 2, 3 and 4, respectively. Section 5 discusses possible extensions as 
well as limitations of our theory and suggests some open directions. 

2. Proof of Theorem 1.1. 

2.1. Preliminaries. We view the energy ei p (X,Li, . . . ,Lk) as a function 
defined on G(D, d) K while being conditioned on the fixed data set X . Therefo- 
re, the minimizer of ei p (X, Li, . . . , L/<) is an element (L^, . . . , h' K ) in G(D, d) K . 
Since any permutation of its K coordinates in G(D,d) results in another 
minimizer, we sometimes say that the set {Jj' 1? . . . ,h' K } is a minimizer [in- 
stead of (L^, . . . ,Ljf)]. 

We denote e; p (x,Li, . . . ,L#) := e; p ({x},Li, . . . ,Lk) and view it as a func- 
tion on R D xG(D,d) K . 

We denote the set of all permutations of (1,2,..., K) by Vk- We designate 
an open ball in G(D,d) by Bg(L,:t) as opposed to the Euclidean open ball 
in R D , B(x,r). 

We partition X into the subsets {X{\f = Q with {iVj}|£ points sampled 
according to the distributions {^i]f =a . 
We define 

(10) WiW = M x e»°:-t < |x T v| <t), 

where v is an arbitrarily fixed unit vector in [due to the spherical symme- 
try of fi\ within L^, (10) is independent of v]. We note that since {fii}fL 1 are 
generated by a single distribution, VViW = ip^t) V2 < i < K. The invert- 
ibility of ip^ is established in [12], Appendix A. 2, and an estimate of ip^ for 
a uniform distribution on a (i-dimensional ball appears in [12], Appendix A.l. 
Theorem 1.1 uses the constant To, which we can now define as follows: 

m , (1 - ^({0})) • 2"- 1 • ^((1 + (2K - l)n 1 ({0}))/{2K)Y 

(11) Tn := = . 

V ' (vrVd)P 

In the special case where \i\ is the uniform distribution on B(0, 1) fl Li, 
then the estimate of W in [12], Section A.l, implies the following lower 
bound for To: 

1 

T ° > 2P+ 1 • KP ■ d 3 P/ 2 ' 
Consequently, Theorem 1.1 holds in this case if to in (6) is replaced by 
1/(2 P+1 • K p ■ d? p l 2 ). Furthermore, it follows from basic scaling arguments 
that if Hi is the uniform distribution on B(Q, ri)nLi and supp(/io) ^ B(Q,r2), 
where r\ and ri are any positive numbers, then 



ro> 



2p+i . kp ■ d 3 P/ 2 • r 2 



v ' 
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2.2. Auxiliary lemmata. The following lemmata are used throughout this 
proof (Lemma 2.1 is proved in the Appendix and Lemma 2.2 in [12], Ap- 
pendix A. 2). 

Lemma 2.1. Suppose that Li, Li, . . . , L^- E G(D, d),p>0 and fii is a sphe- 
rically symmetric distribution in B(0, l)nLi. If mini<j<if distc(Li, Lj) > e, 
then 

E m(%( x 'Li,---,L^)) >T e p . 



Lemma 2.2. For any x E R u and Lx,L 2 E G(D, d), 

|dist(x, Li) — dist(x, L2) | < ||x|| distc(Li, L2). 

2.3. Proof in expectation. We verify Theorem 1.1 "in expectation," whe- 
reas later sections extend the proof to hold w.o.p. We use the following no- 
tation w.r.t. the fixed d-subspaces L|, 1^, . . . , L^, L\, L2, . . . , Lr G G(D, d): 

(12) 7(i) = argmindist G (L*,Lj) \/\<i<K 
and 

(13) d = min dist G A-((L* ,...,L* ),(Li,...,L^)). 

The "expected version" of Theorem 1.1 is formulated and proved as fol- 
lows. 

Proposition 2.1. Suppose that Li,...,Lr- are arbitrary subspaces in 
G(D,d), 0<p<l, and I is defined w.r.t. {Li}^ =1 and the underlying sub- 
spaces {lj*}f =1 . If (1(1), ... ,I(K)) is a permutation of (1, ... ,K), then 

E M e; (x, Li, . . . , L K ) - E^e, (x, L^, . . . , L* K ) 

(14) 

" \ To i%f<K aj ~ a °) do- 
On the other hand, if (1(1), . . . ,I(K)) is not a permutation of (1,. . . ,K), 
then 

E M ez (x, Li, . . . , t K ) - E^ei (x, L*, . . . , L* K ) 

(15) 

>Tn( min a,- ) ( min dist?,(L*,L*)/2 ) — an. 
\l<j<K J \l<i,j<K ° 3 J 

Proof. We define 

M = argmaxdistc(L*,L/(j)). 

Ki<K 
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Assume first that (1(1), • • • , I(K)) is a permutation of (1, . . . , K). Using the 
definition of /, we have 

min distG(I4f,Lj) = dist G (L^, L /(M) ) 

1<J<-K 

(16) = dist G K ((Ll,...,L* K ), (L 7(1) , . . . , t I{K) )) 

= d . 

Combining (16) with Lemma 2.1, we obtain that 

, E nu e l p (x,Li,... I Ljc)-E/i M ej (x,L;,...,L^) 

(!7) 

= E W f %( x . L lv,k) >TO^o- 

For any x G Afo, let m(x) = argmini<,<^ dist(x, L*), m(x) = 
argmini<j<^dist(x, Lj) and note that 

% (x, Li, . . . , L K ) - e ip (x, L*, . . . , L* K ) 



dist(x, L ?fl ( x )) p — dist(x, L 



m(x) / 



(18) > dist(x,L A{x) ) p -dist(x,L*_ 1(A(x)) f 

> - 1 1 x 1 1 p dist G (L A(x) , L* _ i ( A(x) } ) p 

>-||xfc$>-c$, 
where the second inequality in (18) uses Lemma 2.2. Therefore, 

(19) E Ato e/ J ,(x,L 1 ,...,L i ^) - E Mo e^(x,Li, . . . ,L* K ) > -d? Q . 
At last, we observe that 

E M e* p (x, Li , . . . , L^) - E M e Zp (x, L^ , . . . , L* K ) 

(20) > a M (E mi ei p (x, L l5 . . . ,t K ) - E MAf ez„ (x, L*, . . . , L^)) 

+ «o (E M0 e ip (x, L x , . . . , t K ) - E M(J %(x,L^, . . . ,L* K )). 

The proposition in this case thus follows from (17), (19) and (20). 

Next, we assume that 1(1), ■ ■ ■ , I(K) is not a permutation of 1,2, ... ,K. 
In this case, there exist 1 < n\,ni < K such that I(n\) = I(n2) and, conse- 
quently, 

2 1 i? li iV distG ( L M' L i) = 2dis M L M, L /(M)) 

1<J<K 

^ > dist G (L; i ,L /(ni )) + dist G (L; 2 ,L /(n2) ) 

> dist G (l; i5 l; 2 ) 

> min distcfL*, L*). 

~ i<i,j<K y 1 31 
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Combining (21) and Lemma 2.1 [applied with e = mmi<jj<#-dist G (L|, 
L;- ) /2] , we obtain that 

/ E MM e Jp (x,Li J ...,Ljr)-E MM e^(x J L;,...,LS r ) 
(22) 

>r °( 1 <- mi 5^ distG ^' L i)/ 2 ) P - 

\l<i,]<K J J 

Finally, since the support of /io is contained in B(0, 1), we note that 

(23) E M e lp (x, Li , . . . , L K ) - E Mo% (x, L* , . . . , L* K ) > - 1. 
The proposition is thus concluded from (20), (22) and (23). □ 

2.4. Proof in a local ball by calculus on the Grassmannian. We cannot 
directly extend (14) to an estimate w.o.p., since its lower bound is a multipli- 
cation of c?q, which approaches zero as the set {hi]f =l approaches {L|}^ 1 . 
We will need to exclude a ball in G(D,d) K around {L|}^ 1 before such 
an extension. We thus prove here that {L|}^ 1 is a unique global minimizer 
w.o.p. in a local ball. In Section 2.5 we extend Proposition 2.1 to an estimate 
w.o.p. outside this ball and conclude the theorem. 

We show that there exists a sufficiently small number 71 such that {L*}^ 
is the unique global minimizer w.o.p. of c\ v in Bq((L* 1 , . . . , L* ),7i). Since c\ v 
is permutation invariant, it is also the unique global minimizer in 

|J B G ((L* 1 ,...,L* K ), 71 ). 

ii,t2,...,i K <=Vk 

In order to simplify notation in this part of the proof, we will adopt 
WLOG the convention that the RHS of (3) occurs at % = 1, that is, 

(24) dist G (L;,Li)= max (dist G (L*, £,<)). 

i=l,..., K 

Following this convention and the fact that e/ p (^£2 *^i> • • • > ^*k) = 0> 
it is enough to prove that (L^, . . . , L^) is the unique global minimizer w.o.p. 
of ci p (Xq U <%i,Li, . . . ,L K ) in B G ((L|, . . . ,L* K ),jt), for sufficiently small 71. 

Let to := dist G (L*,Li). For each 1 < i < K, we parametrize according to 
arc length the geodesic lines from L* to Lj by functions Lj(f), 1 <i < K, on 
the interval [0,to] such that 

(25) Li(0) = L* and L t (t ) = U. 
We will prove that for sufficiently small 71 > 0, 

(26) ^(e ip (^ U^,L 1 (t),...,k(f)))>0 for all < t < 71 w.o.p. 

This will clearly imply our desired result. 

Our proof of (26) is based on the following estimate: 

(27) ^(e lp ( X M(t),...,L K (t))) 



> -||x| 

t=0 
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In order to establish (27), we denote j = argmhii<j<j<- dist(x, L*) and apply 
Lemma 2.2 to obtain that 



(28) 



^(ej p (x,Li(t),... I L J< :(t))) 



dist(x,Lj(t)) p - dist(x,Lj(0))P 



i=0 t^o tP 



> — llxll lim 



dist G (L J (0,L,(0))f 



t^o tP 
We also note that for all < t < to, 

{ ' tP ~ tP 

Indeed, if t = to, the inequality in (29) follows from (24) and the equality 
follows from (25). Moreover, both of them extend to < t < to by the un- 
derlying property of arc length parametrization. Equation (27) thus follows 
from (28) and (29). 

Combining (27) with Hoeffding's inequality, we obtain that 



(30) ^(e lp (X ,U(t),...,L K (t))) 



> — ||x|| > — aoN w.o.p. 



We similarly derive an equation analogous to (30) when replacing Xq 
with X\ by applying some arguments of the proof of Lemma 2.1 and Ho- 
effding's inequality as follows: 



(31) 



= ^-(e h (X 1 M(t))) 
> TqOIiN w.o.p. 



t=0 



At last, combining (30), (31) and (6), we obtain that there exists j[ = 
7{(D,d, K,p, a.Q,ai) such that w.o.p. 



^( e/p (^oUA' 1 ,Li(t),...,L it (i))) 



> (roai — ao)N > j'iN. 



Using the arguments of the proof of [12], equation (35), we conclude that 
there exists a constant 71 = ji(D,d, K,p, ao,ai,min2<i<ftrdist(L*,L|), 
Mo^i) > such that (26) holds. 

2.5. Conclusion of Theorem 1.1. In order to conclude the theorem, it is 
enough to prove that {L|, . . . ,L* K } is the unique global minimizer w.o.p. of 
% (X U Xi, Li, . . . , L K ) in the set 

(32) GF(D,d, 11 ):=G(D,d) K \ [j B G ((L*, . . . ,L* K ), 7l ). 

81,22, ...AkZLVk 
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Combining Proposition 2.1, the fact that do > 71 [which follows from the 
definition of do in (13)], Hoeffding's inequality and (6), we obtain that there 
exists 72 = j 2 (D,d,K,p,a , mini<j<x a; , mini<j^< K dist (L* , L* ) , fi , m ) > 

such that for any fixed (Li, . . . ,Lk) £ GP(D, d,7i), 

(33) e lp (X,L 1 ,...,LK)-e lp (X,L* l ,...,L* K )> l2 N w.o.p. 

Following the proof of [12], Theorem 1.1 [i.e., covering GP(D,d,ji) by balls], 
we easily extend (33) w.o.p. for all K subspaces in the set GP(D,d, 71) 
(instead of fixed ones) and thus conclude the theorem. 

3. Proof of Theorem 1.2 and a counterexample to asymptotic recovery. 

3.1. Proof of Theorem 1.2. Following the argument of [12], Section 3.5.1, 
we reduce the verification of Theorem 1.2 to proving that there exists a con- 
stant 73 > such that if for all permutations i\,...,iK £ Tk-, La, ... , Lk £ 
G{D,d) satisfy that dist G ic ((L? , . . . ,L* ), (£4, . . . ,Lk)) > f, then 

(34) E^e,, (x, La, . . . , l K )) > E M (e /p (x, L* lz . . . , L* K )) + 73 + 2e p . 

In view of Proposition 2.1, in order to conclude (34), it is sufficient to 
verify that 

(35) (t min aj - a ) f > 73 + 2e p 
and 

(36) Tn min a,- min dist?, (L*, L*)/2 P — an > 73 + 2e p . 

y ' l<j<K J l<i,j<K 

Setting 73 = e p /2, (35) follows from (8) and (36) follows from (7). 
3.1.1. Remark on the size of e. If 

(37) e>nVd3~ 1/p (T min ay - aoV^A 

then / > 7r\/d/2, so that there is no restriction on the minimizer of (1) in 
G(D,d) K . It thus makes sense to further restrict e to be at least lower than 
the right-hand side of (37). 

3.2. A counterexample to exact asymptotic recovery with noise. One may 
ask if it is possible in the noisy setting (e > 0) to recover the underlying sub- 
spaces as the number of sampled points, N, approaches infinity. The answer 
to this question is positive when K = 1 (see, e.g., [2], Section 11.6, [18]) 
or d = (see [17]). However, it is often negative when d > 1 and K > 1, as 
we demonstrate in Figure 1(a) and explain below. In this example, D = 2, 
K = 2, d=l, «o = and the two underlying distributions [i\ and \i 2 (corre- 
sponding to the two underlying lines L\ and Lg) are uniformly distributed 
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(a) (b) 

Fig. 1. A counterexample showing that exact recovery with noise is impossible even 
asymptotically, (a) Gray regions of uniform distributions around the two underlying 
lines, (b) The gray region is the intersection of Yi with the uniform distribution region 
around L\. The best l p line in Yi is Li. 

in the two gray regions demonstrated in this figure (the region around is 
a rectangle and the region around L?i is a union of two disjoint rectangles). 

In order to verify that this is indeed a counterexample, we use a Voronoi- 
type region, which allows us to reduce approximation by multiple subspaces 
to approximation by a single subspace on it. Such regions {Yi}fL±, which are 
frequently used in Section 4, are obtained by a Voronoi diagram (restricted 
to the unit ball) of given d-subspaces {Lj}^ 1 C G(D, d) as follows: 

Yi(Li,...,Lji-) 

(38) 

= {x G B(0, 1) : dist(x, U) < dist(x, Lj) Vj : 1 < j / i < K}. 

These regions are useful to us due to the following elementary proposition, 
whose trivial proof is described in the Appendix. 

Proposition 3.1. If L[, . . . ,L' K 6 G(D,d), v is a probability measure 
on R D and 

(Li,..., 14) = argmin E„(q p (x,Li, . . . , L*-)), 

(Li,...,L if )6G(£),£0^ 

then 

(39) Li= argmin E^e^^L^xe Y^Li.L;..,^))). 

LieG(D,d) 

We claim that for any fixed p > 0, the distance between {L^L^} and the 
global minimizer of (1) in the setting of this example is bounded from below 
w.o.p. by a positive constant independent of the sample size, N, for suffi- 
ciently large N. Equivalently, we claim that the distance between {L^L?;} 
and the global minimizer of E^ £ (dist p (x, lL=i Lj)) is positive, where fx e is the 
underlying mixture distribution for this example. In view of Proposition 3.1, 
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we only need to show a positive distance between L* and the minimizer of 
E Me (ej p (x,L)/(x G Yi)), where Yi = Y^L^LJJ). We refer to this minimizer 
as the best l p line for Yi and denote it by Li (while arbitrarily fixing p). We 
note that for any p > 0, the integral of l p distances of points in the part of Yi 
above from the line is smaller than the similar integral in the bottom 
part. Therefore, Li is different than and the respective orientation of the 
two lines is demonstrated in Figure 1(b). The claim is thus concluded. 

4. Proof of Theorem 1.3. 

4.1. Preliminaries. 

4.1.1. Notation. We designate the projection from K D onto its subspace L 
by Pl and the corresponding orthogonal projection by P^ . We define 

(40) D L , x , p = P L (x)P L ± (x) T dist(x,L)^- 2 ). 

We frequently use the Voronoi-type regions {Yi}f =1 defined in (38) with 
respect to the subspaces {L*}^ and possibly two additional arbitrary sub- 
spaces denoted by L 2 G G(D,d) and L 2 G G(D,d). We will use the following 
short notation for 1 < i < K: 

(41) Y i = Y i (L* 1 ,L 2 ,L* 3 ,...,L* K ), Y i = Y t (Ll,L 2 ,Ll,...,L* K ) 
and 

(42) Y i = Y i (L* 1 ,L* 2 ,Ll,...,L* K ). 
We denote by Yj the closure of Yj, that is, 

(43) Y i = {xGB(0,l):dist(x,L*) <dist(x,L*) Vj:l<j^i<K}. 

Similarly, the closure of Yj is denoted by Yj. 

Let Ck denote the fcth-dimensional Lebesgue measure. We denote d* = 
d A (D — d) and let ^*(L*,L*) be the tf*th largest principal angle between 
the (f-subspaces L* and L*. Our analysis uses the distribution fi = aofio + 
^2d = i ctiHi, even though the underlying distribution of our model is fi £ . For L, 
L* G G(D,d), we define the "orthogonal subtraction" as follows: 

L*0L = L*n(LnL*) ± . 

4.1.2. Auxiliary lemmata. Using the notation above, we formulate two 
lemmata, which will be used throughout this proof. The proof of Lemma 4.1 
is identical to that of [12], Proposition 2.2 (while replacing sums by expec- 
tations), whereas Lemma 4.2 is proved in the Appendix. 

Lemma 4.1. For any L* G G(D,d) and distribution fi, a necessary con- 
dition for L* to be a local minimum o/E„(L(x, L)) is 

(44) E M (D L . iXJ ,)=0. 
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The next lemma quantifies the sensitivity of the region Yj, where 1 < 
j < K, to perturbations in the subspace Lj, where 1 < i 7^ j < K. WLOG 
we formulate it with j = 1 and i = 2 [note that we use the short notation 
of (41)]. 

Lemma 4.2. // L 2 , L^LJ;, . . . ,L* K are subspaces in G(D,d) such that 

(45) min(^(L 2 ,L*)) > 0, min (^(L*,L*)) > 
and 

(46) ML 2 ,L^)VML2\L^< min ^(L*,L^), 

3<i<K 

then 

(47) £d((Yi\Yi)U(Yi\Yi))>0. 

4.2. A special case. The proof of Theorem 1.3 is rather involved. In order 
to develop a simple intuition, we provide an elementary proof of the very 
special case where d = 1, p = 2 and K = 2. For simplicity we also assume that 
D = 2, though our argument easily extends to D > 2. Figure 2 shows the two 
underlying lines and and their corresponding regions Yi and Y 2 . We 
note that the best I2 lines [in G(D, 1)] for restricted to Yi and Y 2 are the 
central axes of those regions. Since ao > 0, the best I2 lines [in G(D, 1)] for fi 
restricted to Yi and Y 2 (denoted by Li and L 2 , resp.) must reside between 
the best I2 lines for fj,Q restricted to Yi and Y 2 and and L 2 , respectively. 
In particular, they are different from and L2 as demonstrated in the 




Fig. 2. Illustrative proof of Theorem 1.3 in the special case where p = 2, d = 1, D — 2 
and K = 2. 
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figure. Therefore, E At (e; 2 (x,Lj,L|)) > E /J (e; 2 (x,Li,L2)). This implies that 
w.o.p. ei 2 {X,Ll,L* 2 ) > e l2 (X,li,L 2 ). 

4.3. Reduction of the statement of Theorem 1.3 to simpler formulations. 

4.3.1. Reduction I; Using the Voronoi-type regions {Yi}f =1 . We will show 
here that the following equation implies Theorem 1.3: 

(48) 7 D,d({Ll}f£i C G(fl,rf):E w (J(x G Y,)D L ., x , p ) = VI < j < K) = 0. 

First, we apply the argument of [12], Section 3.6.1 (which requires the 
assumption specified in Section 1.3 that the first moments of are 
smaller than e) to obtain that Theorem 1.3 follows by the equation 

(49) 

= argmin E^e; (x,L 1; . . . ,L K )) ) =0. 

(L U ...,L K ) ' 

Next, applying Proposition 3.1, we conclude that (49) is a direct conse- 
quence of the equation: 

lD,d(m}f=i C G(D,d) :L* = argmin E, t (e /p (x,L)/(x G Y,)) 

V LeG(D,d) 

(50) 

VI < j < Kj = 0. 

Furthermore, applying Lemma 4.1 with /i = /i| y ■ , we obtain that (50) follows 
by the equation 

(51) l% A m}f =1 C G(D, d) : E M (7(x G Y^D^,^) = VI < j < K ) = 0. 

At last we conclude the desired reduction by noting that (51) and (48) 
are equivalent [indeed, the only relevant components of the distribution \x 
in (51) are hq and \Xj and the corresponding expectation according to fij is 
zero] . 

4.3.2. Reduction II: From K subspaces to a single subspace. We redu- 
ce (48) so that its underlying condition involves a single subspace as follows: 

~f D JV 2 €G(D,d): min ^(L*,L*)>0, 

\ Ki^j<K J 

(52) " " 

argminMLi,L*) = 2,E w (/(x G Yi)D L ., XiP ) = 0) = 0. 

2<i<K J 

We remark that some of the underlying technical conditions of (52) appear in 
(45) and (46) and will be better understood later when applying Lemma 4.2. 

We verify this reduction as follows. WLOG (52) can be formulated by re- 
placing \j\ with L£, for some 3 < k < K, while letting argmin2<.;<j<- 0d* (L|, 
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L*) = k. Combining this observation with elementary properties of distribu- 
tions, we have that 

7gd({L?}£i c G(A«i) :E«,(/(x € Y^D^) = VI < j < K) 

K 

<E / -YdJH: min ^(L*,L*)>0, 

argmin0 d *(L*,L*) = fe, 

2<i<7 ; f 

E Mo (J(x G Yi)D L . jXJ ,) = OKL.^i^fc^dC^'aL*}!^^)) 
+ 7D, d ({L*}f =1 cG(Ad) : min ^(L*,L*) = o) = 0. 

4.4. Concluding the cases d = 1 and d = D — 1. We assume first that 
d = 1. We conclude the theorem in this case by proving (52) and then extend 
the analysis to the case d = D — 1. 

4.4.1. Reduction of (52) using additional condition on the Grassmannian. 
We fix vi to be one of the two unit vectors spanning and denote by ui 
the unit vector spanning (L| + Lj) n L* having orientation such that for 
any point xGLj: (x t ui)(x vi) > 0. We will prove that (52) follows from 
the following equation, which introduces a restriction on the Grassmannian: 

7D,dfeeG(Ad): min ^(L*,L*)>0, 

(53) argmin^(L^L*) = 2, 

2<i<K 

E w (J(x G Yi)D L j )XjP ) = 0| (L* + L* 2 ) n LJ 1 = Sp(uO) = 0. 

We define the following subset of the sphere S D ~ l : Qq = {x G .Sr " 1 : x _L 
v}, and a distribution oj on f^o such that for any A C O :lj(A) = ^04(^2 ^ 
G(D, d) : (LJ + L$) n Lf- 1 G Sp(A)). Using this notation, (53) implies (52) as 
follows: 

1 D jL*eG(D,d): min 9 d * (L*, L*) > 0, argmin^, (L*, L*) = 2, 

V l<i¥=3<K 2<i<K 

E wj (7(x€Yi)Dl;^) = 0) 
= / 7D,dfe: min # d *(L*,L*) > 0, argmin# rf » (L*, L*) = 2, 

Jq V i<i^j<K 2<i<K 

E M0 (/(x G Yi)Dl*, x ,p) = 0|(L; + L*) n L* x = Sp(m)) d(w(ui)) 

= 0. 
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4.4.2. Proof of (53). We will show that at most one element satisfies the 
underlying condition of (53) (i.e., it is a member of the set for which jd d 1S 
evaluated). Assume, on the contrary, that there are two subspaces L2 and L2 
satisfying this condition with corresponding angles 9 = #d*(L*,L2) and 9 = 
d *(L|,L 2 ) in [0,7r/2], where WLOG 9 > 9. Using the notation of (41), we 
have that 

E M0 (/(x G Yi \ YiJD^,^) - E w (/(x G Yi \ Yi)D l . jXjP ) 

(54) = 2 • (E M (/(x G YODl.,^) - E w (J(x G Yi)D L * >XiP )) 
= 0-0 = 0. 

Consequently, 

(55) E^ (J(x G Yi \ Yi)vf Dl* , XjP ui) - E w (I(x G Yi \ Yx)vf D LJ w m ) = 0. 
Defining 

Ui • x 



#ui,vi ( x ) = arctan 



vi • x 



and 



Y,a = (xGB(0, l):dist(x,L*) < min dist(x,L*)), 

' I 3<i<K ) 



we express the regions Yi and Yi as follows: 

(56) Yi = Y li2 n {x G B(0, 1) :9/2 - ir/2 < 9 U1)V1 (x) < 9/2}, 

(57) Yi = Y lj2 n {x G B(0, 1) : 0/2 - ir/2 < 9 UuVl (x) < 9/2}. 
Figure 3 clarifies (56) and (57) in the special case where d = 1 and K = 2. 



Yi\Y 




Yi \ Yi 



Fig. 3. The regions Yi and Yi anrf t/ie relation to 6 and 9 when d=l and K = 2. 
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Combining (56) and (57) with the definition of Dl, XiP in (40), we obtain 
that 

(58) Yi\YiC {xGB(0,l):vfxx T ui = dist(x,L* 1 )( 2 - p) vfD L j iXiP ui > 0} 
and 

(59) Yi \ Yi C {x G B(0, 1) : vf xx T Ul = dist(x, L*)( 2 ^vf D L * )X|P ui < 0}. 

It follows from Lemma 4.2 that £d((Yl \ Yi) U (Yi \ Yi)) > and, con- 
sequently, for any r > 0, £ D (B(0,r) n ((Yi \ Yi) U (Yi \ Yi))) > (indeed, 
if x G Yi, then c ■ x £ Yi for any < c < l/||x||; thus, the distribution in 
the latter inequality is just a scaling by r D of the distribution in the for- 
mer one). Since there exists r > such that the restriction of Co to B(0, r) 
is absolutely continuous with respect to /io, we also have that /io(B(0,r) n 
((Yi \ Yi) U (Yi \ Yi))) > 0. However, this contradicts (55), (58) and (59), 
that is, it proves (53) and therefore the theorem in the current special 
case. 

4.4.3. The case d = D — 1. We note that the proof of the above case 
id = 1) can be adapted to the case where d = D — 1. This is done by letting vi 
be one of the two unit vectors spanning n (LJ RL^) -1 " [note that dim(L^) = 
D - 1 and dim(LJ n L|) = d - 2 so that dim(L^ n (L\ n LJj)- 1 ) = 1] and u x be 
the unit vector of (L^ + L£) Pi L^ with a similar orientation as in the case 
where d= 1. 

4.5. Conclusion: The case where d^l and d^= D — 1. 

4.5.1. Reduction of (52) using additional condition on the Grassmannian. 
The following reduction is analogous to the one of Section 4.4.1. Denoting 
by B(R D , R D ) the space of linear operators from R D to itself, we define 

n x = {(Pi,P 2 ) G B(R D ,R D ) 2 :3L G G(D,d) not orthogonal to Lj, 
s.t. dim(L^ 0L) > 1,P L T *P L P L ; = P^PjfPiPt. = P 2 } 
and the distribution u\ on fii as follows: for any set A C f^, 

^i(A) = 7 D,d(L G G(D,d) : (Ffi.PL^PifPLPfc) G A). 
Using this notation, we reduce (52) as follows: 

lD4 (h* 2 G G(D, d) : LJ / L* 2 , dim(Lj D L^) > 1, 

mm 6 d * (LJ , L*) > 0, arg min 6 d * (L*,L*) = 2, 

1<^J<^ 2<i<K 

(60) 

E Mo (/(xGY 1 )D L j )XiP ) = 0| 

(P L T *P L .P L .,P L f Pl-P^) = (P l5 P 2 ) G ^) =0. 
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Indeed, 

lD,d (l 2 G G(D,d): min 6 d * (L* , L* ) > 0, arg min 9 d * (LJ ,L*) =2, 

V l<v£j<K 2<i<K 

E WJ (/(x€Y 1 )D LW ) = o) 
< / 7£) ^(L2 is not orthogonal to L 2 , 

dim(L^0L* 2 )>l, min d *(L*,L*) > 0, 

l<i^=j<K J 

argmin0 d »(L*,L*) = 2, 

2<i<A" 

E W) (/(x€Yi)D L; , X)P ) = 0| 
(Pgii;ii I ,P L J j r Pl* 2 P^) = (Pi,P 2 ) € n0d(o;i(Pi,ft)) 
+ 7D,rf( L 2 G G(D, d) : dim(L^ L* 2 ) < 1, or L 2 J_ L*) =0 + = 0. 

4.5.2. .BuZA; 0/ i/ie proof. We prove (60) by using the following two lem- 
mata, which are proved below (Sections 4.5.3 and 4.5.4). 

Lemma 4.3. If dim(L^ L 2 ) > 2 and is not orthogonal to L 2 , i/ien 
the set 

Z = {L G G(£>, d) : P L * (Pl* - Pl)Plj = 0, Pf}, (P L * - P L )P L ± , = 0} 
is infinite. 

Lemma 4.4. 7/L 2 ,L 2 G G(D,d) satisfy U^U, 0*(U,LI)^ 'M L 2% L i) < 
min 3 <K* ML?,Li), Plj(Pl 2 - PfjPL* = and P±(P U - PQP^ = 0, 
then either L 2 or L 2 will not satisfy the condition in (60). 

To conclude (60), we rewrite it as follows: d(A\B) = 0, where A and B 
are clear from the context. We note that Lemma 4.3 implies that there are in- 
finitely many subspaces L 2 in B. On the other hand, Lemma 4.4 implies that 
there is only one subspace L 2 in A. These observations clearly prove (60). 
We remark that the idea of this proof is somewhat similar to that of the 
previous case where d=lord = L> — 1. In this case, Lemma 4.3 is analogous 
to the fact that there is a degree of freedom in choosing L 2 in (53) [since we 
can choose any 0^*(L^,L 2 ) < min3<j<^- 6 d * (L*, L*)]. Moreover, Lemma 4.4 is 
analogous to the fact that there were not two subspaces L 2 and L 2 satisfying 
the underlying condition of (53). 

4.5.3. Proof of Lemma 4.3. We denote Li = Lj[ (L\ n L 2 ) and L 2 = 
L 2 (L\ n L 2 ). The idea of the proof is to construct a one-to-one function 
g-.S®" 1 n L 2 — > Z. Then, using this function and the fact that dim(L 2 ) = 
dim(Lf) - dim(L 2 n L\) > 2, we conclude that Z, which contains giS ^ 1 n 
L 2 ), is infinite. 
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For any uo G S 0-1 n L 2 , we arbitrarily fix vo = vo(uo) as one of the two 
unit vectors spanning Li n (L2 Sp(uo)) -1 ". The vector vo exists since 

dim(Li n (L 2 Sp(u )) ± ) > dim(Li) + dim((L 2 Sp(uo))- 1 ) - D 

= d+(D-d+l)-D = l. 

We define the function g as follows: 

g(u ) = Sp(u - 2(v^u )v , L* 2 Q Sp(u )). 

We first claim that the image of g is contained in Z. Indeed, we note that 

Pg(u ) - P H = (uo - 2(v^u )v ) r (u - 2(v^u )v ) - u£u 

(61) 

= ~ 2 ( v o u o)(v (u - (v u )v ) + (u - (v u )v ) .v ) 

Combining (61) with the following two facts: vo G L* and uo — (vg'uo)vo G 
L*- 1 , we obtain that 0(110) G Z. 

At last, we prove that g is one-to-one and thus conclude the proof. If, on 
the contrary, there exist ui, u 2 G S D ~ 1 n L 2 such that ui 7^ u 2 and 0(111) = 
0(112), then 0(111) = Sp(<?(ui), 0(112)) 5 (L 2 Sp(ui)) + (L* 2 Sp(u 2 )) 5 L* 2 . 
Since dim(0(ui)) = din^L^), we conclude that 0(111) = L|- On the other 
hand, we claim that for any uo G S D ~ l D L 2 : 0(110) 7^ L 2 and thus obtain 
a contradiction. Indeed, since Uo G L 2 , vo G Li and is not orthogonal 
to L 2 , we have that Vq"uo 7^ and, consequently, uo — (v Q n uo)vo / uo- Ap- 
plying the latter observation in (61), we obtain that P g (u ) Pl* and, con- 
sequently, 0(110) /L^. 

4.5.4. Proof of Lemma 4-4- We assume, on the contrary, that both L 2 
and L 2 satisfy the underlying condition of (52) and conclude a contradiction. 

We arbitrarily fix here x G Yi \ Yi [using the notation of (41)]. We note 
that dist(x, L\) < dist(x,L 2 ) and dist(x,L^) < argmin3<i<^ dist(x, L*). Sin- 
ce x ^ Yi, we have that dist(x,L^) > dist(x,L 2 ) and, thus, 

(62) dist(x, L 2 ) < dist(x, L*) < dist(x, L 2 ). 
Consequently, 

(63) xT ( P l 2 ~ J Pf J2 ) x = dist ( x 'L2) 2 - dist(x,L 2 ) 2 < 0. 

We partition P u - P^ into four parts: P^{P U - P£ 2 )Pl?, P^U ~ 
P U )P£. , P L * (P U - P U )P£. and PJ-. (P u - P u )P Ll . The first two are zero, 
and the last two are adjoint to each other; we thus only consider Pl* (Pt 2 ~ 
Pl 2 )Pl*- Let its SVD be 

d 

(64) P ht (P u -P U )P£* = U£V = J>u iV f. 

i=i 
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We can express the SVD of P^ 2 ~~ Pt 2 usm S (64) and the partition above as 
follows: 



(65) P U -Pl 2 = Yl + Viuf). 

i=i 

Combining (63) and (65), we obtain that 

n / n \ 

(66) ^ a ^ xxTv * = xT [Yl ai ( UiV ? + ViU ^ x / 2 < °- 

i=l \i=l / 

We define a function / : R DxD -> R such that for any A G M DxD : /(A) = 
Y%=1 o"iuf Avj. Using (66) and the fact that {ujf =1 G LJ and {vjf =1 G L*- 1 , 
we deduce that 

/(D L *, x , p ) = dist(x, V 1 )^f(P Ll (x)J^ (x) T ) 

n 

(67) = dist(x, L^~ 2) £ cr^J P Ll (x)7^ (xfv, 

i=l 
n 

= dist(x, LJ ) (p " 2) ^uf xx T Vi < 0. 

i=l 

Similarly, for any point x G Yj \ Yi, 

(68) /(D L .x, P ) > 0. 

Combining (54), (67), (68), Lemma 4.2 and the linearity of /, we conclude 
the following contradiction establishing the current lemma: 

= /(E w (/(x G Yi \ Yi)Dlj, x ,p) - E w (/(x G Yi \ YODlj,^)) 

(69) = /(E w (/(x G Yi \ YODlj,^)) - /(E w (7(x G Y x \ Yi)Di* )X)P )) 
>0. 

4.5.5. Remark on the sizes of do and Kq. The constants 5q and kq depend 
on other parameters of the underlying weak HLM model, in particular, the 
underlying subspaces {L*}^. For example, one can bound both kq and 5q 
from below by the following number: 



max E At (e /p (x,L*)/(xGY J ))- T mm E M ( ejp (x,L)7(x€ Y*)) J/(4p). 

If p > 2, then a simpler lower bound on both kq and 5q is 

||maxi<i<j<-E M (D L * )Xj p7(xG Yj))||§ 
pdD2P+ 5 ' 
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5. Discussion. We studied the effectiveness of l p minimization for recov- 
ering (or nearly recovering) all underlying K subspaces for i.i.d. samples from 
two different types of HLM distributions. In particular, we demonstrated 
a phase transition phenomenon around p = 1 . 

We discuss here implications, extensions and limitations of this theory as 
well as some open directions. 

5.1. Obstacles for convex recovery of multiple subspaces. There are some 
recent methods for robust single subspace recovery by convex optimization 
(see, e.g., [6]). Such methods minimize a real- valued convex function h on 
a convex set H (e.g., set of matrices), which can be mapped on G(D,d). 
However, such a minimization cannot be done for multiple subspaces. In- 
deed, in that case one must minimize a multivariate function h : M K — > R 
for convex H. Clearly, the function h must be invariant to permutations 
of coordinates. Let g be a mapping of EI onto G(D,d). It follows from the 
assumption that the minimization of h leads to the underlying subspaces 
{L*}^ and the permutation-invariance of h that the set of minimizers of h 
coincides with all permutations of xi,X2, . . . ,Xjf, where Xj € g~ l {L*) for all 

1 < i < K. Since h is convex, (2i=i Xj/X, . . . , Yli=i i s a l so a rnini- 

mizer of h. Consequently, e 5 _1 (I4) f° r a ^ ^ < j < and, thus, 

9(%2iLiXi/K) =L* = ■ • ■ = L* K , which is a contradiction. 

Furthermore, a minimization on G(D,d) K cannot even be geodesically 
convex. Indeed, the maximum of a geodesically convex function on a com- 
pact, geodesically convex set is attained on the boundary. However, G(D, d) K 
is compact, geodesically convex and has no boundary, so any function de- 
fined on G(D,d) K is not geodesically convex. 

5.2. Implications for a single subspace recovery. In [12], we discussed the 
recovery of a single subspace. Theorems 1.1 and 1.2 apply to this case when 
K = 1. Unlike [12] which assumed that hq was spherically symmetric (while 
having possibly additional "outliers" along other subspaces, distributed ac- 
cording to {/ii}H 2 )> here we have a very weak requirement from fio (which 
represents all outliers). However, here there is a strong restriction on the 
fraction of outliers, olq, whereas in [12] there was no requirement, except for 
ao < 1. 

5.3. Extending our theory to more general distributions. In Theorems 1.1 
and 1.2, the strict spherical symmetry of {ni}f =l (within {Li}^ =1 , resp.) can 
be replaced by approximate spherical symmetry of {[ii}fL 1 . That is, for each 
1 < i < K and L,; and as before, we form a new distribution /i^, with the 
same support as in such that the derivative of w.r.t. m is bounded away 
from and oo. We then replace \i{ with fj,^. This new setting will require 
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replacing {ai}f =l in (6)~(8) by {5iai}f =1 , where 5 t = m) for 1 < i < K 
(Si is the lowest value of the derivative of p! i w.r.t. /Xj). 

Furthermore, the boundedness of the support of the distributions {//j}^£ 
can be weakened by assuming that these distributions are sub-Gaussian. 
Indeed, this will mainly require changing Hoeffding's inequality with [19], 
Proposition 2.1.9. 

5.4. Distributions resulting in counterexamples for our theory. There are 
several typical cases with settings different than above, where the underlying 
subspaces cannot be recovered by minimizing the energy (1) for all p > 0. 

The first typical example is when there is an outlier with sufficiently large 
magnitude so that the minimizer of (1) contains a subspace passing through 
this outlier, which is different than any of the underlying subspaces. Our 
setting avoids such a counterexample by requiring (6). We briefly provide 
the idea as follows: an arbitrarily large outlier in our setting of supports 
within B(0, 1) means, for example, that the outlier has magnitude one and 
the inliers are supported within B(0,e), where e is arbitrarily small. There- 
fore, V(e) = 1, so that ^((1 + (2K - l)/x 1 ({0}))/2K) < ^(l) = e and, 
consequently, tq ~£ p - In view of (6), we control the fraction of outliers as 
a function of e p . In particular, for a fixed sample size and sufficiently small e, 
no outliers are allowed by this condition. 

The second example is when the distribution of outliers lies on another 
subspace, Lq G G(D, d) and «o > ^^i<i<K on, so that Lq is contained in the 
minimizer of (1). Our setting avoids this counterexample by assuming an 
upper bound on the percentage of outliers in terms of the minimal percentage 
of inliers [see (6)]. 

For the last example we assume for simplicity that D = 2, d = 1, K = 2 
and underlying uniform distributions (of outliers and along the two under- 
lying lines) restricted to the unit disk. We further assume that the two lines 
have angles s and — e w.r.t. the x-axis. By choosing e sufficiently small the 
x-axis and y-axis provide a smaller value for the energy (1) than the under- 
lying lines. We note that in this case (6) does not hold [due to the small size 
of dist G (L*,L*)]. 

5.5. Another phase transition at p = 1: Many local minima for < p < 1. 
Our previous work [12], proof of Proposition 2.1, implies that if < p < 1 and 
there exist distinct subspaces {Li}fL l C G(D,d) such that Sp(Af n Lj) = Lj 
for all 1 < i < K, then {Lj}^ is a local minimizer of the energy (1). We note 
that many subspaces satisfy this condition (in particular, w.o.p. ci-subspaces 
spanned by randomly sampled d vectors). Therefore, L minimization for 
multiple subspaces with < p < 1 will often lead to plenty of local minima. 

This wealth of local minima clearly does not occur when p = 1 (or p > 1). 
It will be interesting, though difficult, to carefully analyze the number and 
depth of local minima for p > 1 . 
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5.6. The case of affine subspaces. Our analysis was restricted to linear 
subspaces, though we believe that it can be extended to affine subspaces. 
Indeed, we can consider the affine Grassmannian [15], which distinguishes 
between subspaces according to both their offsets with respect to the origin 
(i.e., distances to closest linear subspaces of the same dimension) and their 
orientations (based on principal angles of the shifted linear subspaces). By 
assuming only affine subspaces intersecting a fixed ball, we can have a com- 
pact space. We can also generalize (70) (with a different function VVi) anc ^ 
the estimates on 5q and kq in Section 4.5.5 to the case of affine subspaces. 
We remark, though, that it is not obvious whether the metric on the affine 
Grassmannian is relevant for our applications, since it mixes two different 
quantities of different units (i.e., offset values and orientations) so that one 
can arbitrarily weigh their contributions. Also, the common strategy of us- 
ing homogenous coordinates which transform <i-dimensional affine subspaces 
in M. D to (d + 1) -dimensional linear subspaces in R- 0-1 " 1 is not useful to us 
since it distorts the structure of both noise and outliers. 

The minimization of the energy (1) over affine subspaces seems to result 
in more local minima than in the linear case, which can partially explain 
why numerical heuristics for minimizing (1) do not perform as well with 
affine subspaces as they do with linear ones. We are interested in further 
explanation of this phenomenon. 

5.7. The case of mixed dimensions. It will be interesting to try to ex- 
tend our analysis to linear subspaces of mixed dimensions d\ , . . . , dx , known 
in advance. We believe that it is possible to extend Theorem 1.1 and its 
proof to this case. For this purpose, we suggest using the same distance for 
subspaces of the same dimension and defining the distance distc(Li,L2) 
between linear subspaces Li and L2 of different dimensions (with some 
abuse of notation) as follows: if dim(Li) < dim(L2), then distG(Li,L2) = 

min LeL 2 ,dim(L)=dim(Li) distc(Li,L). 

5.8. Further performance guarantees for l p -based HLM algorithms. We 
are interested in extending our theory to analyze heuristics (like the K-sub- 
spaces) which try to minimize the l p energy of (1) in practice. 

5.9. Asymptotic rates of convergence and sample complexity. In Sec- 
tion 3.2 we demonstrated simple instances when noise is present and one 
cannot asymptotically recover the underlying subspaces by l p minimization 
for all p > 0. One may still inquire about the existence of asymptotic limit 
different than the underlying subspaces and quantify the rate of convergence 
(depending on the mixture model parameters) to that limit. That is, assume 
that {Li,!^} is the minimizer of E^(/ p (x, Li, L2)) and {L^L^} is the min- 
imizer of (/ p (x, Li, L2)), where fi^ is an empirical distribution of i.i.d. 
sample of N points from \i. We first ask whether dist({Li, L2}, {L^, L^}) — > 
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as N — > oo. If true, then we ask about the asymptotic rates of conver- 
gence. This will then allow a definition of a sample complexity for multiple 
subspaces as the number of samples required to achieve a prediction error 
within e of the exact recovery of the K (i-subspaces. 

APPENDIX: SUPPLEMENTARY DETAILS 

A.l. Proof of Lemma 2.1. We will use the following inequality for any 
1 < j < K, which is proved in [12], Section A. 1.1: 

/zi (x G B (0 , 1 ) n L* : dist (x, L ) < /3 dist G (L*,L,-)) 

(70) 

<*J^P\ V/3>0. 



2 

We denote ft = ^^( kt^E^lMl ) (the existence of ^ C^ 2 *^ 1 ^ ] 
follows the same proof as in [12], Section A. 1.1) and combine (70) with the 
fact that distc(L^, Lj) > e for any 1 < j < K to obtain that 

Mi(x G B(0, 1) n L; \ {0} : dist(x, Li) < Pie) 

= /ii(x € B(0, 1) n LI \ {0} : dist(x, Li) < ft dist G (L^,Li)) 

= i-M{o}) 

2a: 

Consequently, 



//i ^x e B(0, 1) n LJ : dist ^x, \J L^j > ftej 



K 

> 1 - //({0}) -^ W (xe B(0, 1) DLJ \ {0} :dist(x,Li) < fre) 

i=l 

>(l-/ii({0}))/2, 
and, thus, by Chebyshev's inequality the lemma is concluded as follows: 

_ (1 - M1 ({0}))2^ 1 ^ 1 1 ((1 + (2K - l) Ml ({0}))/(2Jf)^ 



(vr\/d) ? 



A. 2. Proof of Proposition 3.1. The proof is an immediate consequence 
of the following inequality, which uses an arbitrary Li £ G(D,d) and the 
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notation = Y;(Li, . . .,L' K ), l<i<K: 

< E u (e lp (x, Li, L' 2 , . . . , U K )) - E„(e ip (x, Li, ... , L'*)) 

<E„(I(x£Y' 1 )e ip (x,L 1 ))+ Y, E I/ (/(x€Y<)e Ip (x,Li)) 

2<i<A" 

2 E„(I(x£Y;)e ip (x,L:)) 

l<i<K 

= E„(/(x e Yi) eil) (x,Li)) - E„(I(x e Yi)e^(x,L' 1 )). 

A.3. Proof of Lemma 4.2: Geometric sensitivity. We will first show that 
there exists xo € B(0, 1) such that 

(71) dist(xo, L*) = dist(xo, L 2 ) < min dist(xo,L*). 

3<i<K 

We verify (71) in two cases: d* = d and d* = D — d. We will then prove 
that (71) implies (47). Throughout the proof we denote the principal vectors 
of L 2 and L^ by {Vi}^ =1 and {vj}^ =1 , respectively. 

A.3.1. Part I: Proof of (71 ) when d* = d. We define 
x = (v d * + v d .)/||v> +v d *|| 
and arbitrarily fix i$ > 3 and vo S L* o . We will show that 

(72) ang(x ,v )>^(L^L* 1 )/2 
and consequently conclude (71) as follows: 

dist(x ,L* ) > sin(ang(x , v )) > sin(0 d . (L 2 , L*)/2) = dist(x ,L 1 ) 
= dist(x ,L 2 ). 

We can easily verify a weaker version of (72) where the inequality is not 
necessarily strict. Indeed, using elementary geometric estimates and the fact 
that the intersections of the ci-subspaces {L*} f =1 are empty [which follows 
from (45)], we obtain that 

ang(x ,v ) > ang(v d »,v )- ang(v d *,x ) >0 rf *(L* , LJ) - d * (L 2 , L*)/2 

> rf *(L 2 ,L 1 ) — (i *(L 2 ,L 1 )/2 — 9 d * (L 2 ,L 1 )/2. 

At last, we show that (73) cannot be an equality. Indeed, if the first 
inequality in (73) is an equality, then vo, and xo are on a geodesic 
line within the sphere S D ~ l . Combining this with the assumption that 
all other inequalities in (73) are equalities, we obtain that ang(xo,vo) = 
0d«(L 2 ,L^)/2 = ang(xo,Vd*) = ang(xo, Vd* )• This implies that either vo = 
Vd* or vo = v^», which contradicts (45). 

A. 3. 2. Part II; Proof of (71) when d* = D - d. It follows from basic 
dimension equalities of subspaces and (45) that for all 2 < i < If :dim(L* U 
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L*) = D and drm(LJnL£) = 2d-D. We denote by K the integer in {0,...,K} 
such that for any 3 < i < K : L\ n L* = n and for any i > i-C : Li D 
L* 7^ D L2 (the existence of i^o may require reordering of the indices 
of the subspaces {L|}^ 3 ). In order to define xo in the current case, we 
let xi = (v^ + Vd*)/||V(i* + Vrf*||, X2 be an arbitrarily fixed unit vector in 
L* H (L 2 \ Ujc <t<J«: L i)i £ o = dist(x 2 ,Ux <i<K L I) and 

x = x 2 /2 + e xi/5. 

We first claim that 
(74) dist(xo, L^) = dist(xo, L 2 ) < min dist(xo,L*). 

Indeed, we can remove PlLg from the subspaces {L*}S and obtain sub- 
spaces of dimension D — d intersecting each other at the origin. We can then 
rewrite (74) by replacing {L*}^ with their reduced version and xo with xi. 
The argument of Section A. 3.1 thus proves this equation. 

We conclude (71) by combining (74) with the following observation: 

dist(xo,L*) = eodist(xi,Li)/5<eo/5<distfx 2 /2, |J L*j-e /5 



(75) 



K <j<K 



< distf X2/2 + eo x i/5, [J L* J = min dist(xo,L*). 

K <j<K ' Ko<l ^ K 



A. 3. 3. Part III; Deriving (47) from (71) in a simple case. We note 
that (71) implies that 

(76) x G (Yi U Y 2 U (Y a n Y 2 )) D {% U Y 2 U (% D %)) 
and, consequently, 

(77) B(x , e) C (Yi U Y 2 U (Y x n Y 2 )) D (Yx U Y 2 U (Y x n %)). 

We will deduce here (47) from (77) in the simpler case: Yi n Y2 n B(xo, e) 7^ 

Y 1 nY 2 nB(x ,e). 

Using (77) and the fact that £d(Yi n Y2) = 0, we may choose y € (Yi n 

Y2 n B(xo, e)) Pi (Yi U Y2); WLOG we assume instead of the latter condition 

that y G (Yi n Y2 D B(xo, e)) fl Yi. By slightly perturbing y we can choose 

another point yo such that yo G Y2 and yo G Yi \ Yi. It follows from the 
continuity of the distance function that there exists a small r\ > such that 
(Yi \ Yi ) U (Y x \ Yi ) D Y x \ Yi => B (y , 77) , which proves (47) . 

A. 3. 4. Part IV: Deriving (47) from (71) in the complementary case. At 

last, we assume that Yi n Y2 fl B(xo,e) = Yi n Y2 H B(xo, e). We show here 
that it leads to the contradiction: L2 = L^. 
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We note that the sets of solutions in B(xo,e) of the equations x t (Pl* — 
P L *)x = and x T (P L j - Pjjx = are Yi n Y 2 n B(x ,e) and Yi n Y 2 n 
B(xo,e), respectively. In view of (77), these solution sets coincide. They are 
{D — l)-manifolds and, thus, their {D — 1) -dimensional tangent spaces at xo, 
that is, Xq (Pl* — Pl*) = an d Xg (Pl* — P La ) = 0, also coincide. Conse- 
quently, we have that Xq (Pl* — Pl*) = io x () (Pl* — P± ) for some to / 0. Sim- 
ilarly, for any xi £ Yi n Y 2 Pl B(x ,e), we have xf (Pl* - Pl*) = tixf (P L * - 
P La ) for some t\ ^ 0. We note that t% = to by the following argument: 

iixf(P L * -Pl 2 )x = x^Plj -P L *)x = t xf (P L * -Pl 2 )x . Therefore, there 
exists t^O such that for any xi £ Yi n Y 2 n B(xo, e), 

(78) xf(P L * - P L *) = t^(P L1 -P U ). 

Since the tangent space of Yi n Y 2 n B(xo, e) [or, equivalently, x t (Pl* — 
P L2 ) X = 0] at xo has dimension D — 1, the subspace Lq = Sp(Yi n Y 2 n 
B(xo,e)) [i.e., the closure of all finite linear combinations of vectors in Yi fl 
Y 2 n B(xo,e)] has dimension at least D — 1. In view of (78), Lq satisfies 

(79) P H (P Lt -P L *) = tP L *(P Ll -P L2 ). 

Due to the symmetry of (Pl* — P^ ) and (Plj — Pl*), we have the following 
equivalent formulation of (79): 

(80) (P Lt -PL*)PL ( * = (PL*- J P L2 )^ S . 

Furthermore, using the fact that (Pl* — Pl 2 ) an d (Pl* — -^L 2 ) have trace 0, 
we obtain that 

tr (P L . ± (PL* - Pl* )P L( *x ) = - tr(P LS (P L * - P L * )Pl S ) 

(81) =-t.tr(P L5 (P L j-P L2 )PL S ) 

= t.tr(P LS x(P Lt -P i2 )P L5 x). 

Since Pl* x i s a t most one-dimensional, (81) can be rewritten as 

(82) P L ,x(P L * - P L *)P L *x = * • (P L5 x(P L * - PfjPjfL). 

Combining (79), (80) and (82), we obtain that (Pl* -P^) =t(P L . -P L *), 
equivalently, 

(83) P £2 = (l-t)P L *+tP L *. 

We conclude the desired contradiction in two different cases. Assume 
first that t < 1 and let vq be an arbitrary unit vector in L?j. We note that 
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VqPl 2 v o = 1 aswe ll as (1 — i)v^PL*vo = 1 — £v^Pl*vo > 1 — i. Consequently, 
v^Pl*vo = 1, that is, vo £ L* and, thus, we obtain the following contradic- 
tion with (45): L* = L2 [in view of (83), this is equivalent with L2 = LJ]. Next, 
assume that i > 1 and, as before, vo is an arbitrary unit vector in ■ In this 
case, VqP^ v = (1 - t)v% P L * v + tv%P L * v < + = 0. Therefore, v £ L^ 

and we obtain the following contradiction with (45): L?; = L2. Equation (47) 
is thus proved. 
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